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Abstract 



We propose a novel non-parametric adaptive anomaly detection algorithm for high 
dimensional data based on score functions derived from nearest neighbor graphs 
on n-point nominal data. Anomalies are declared whenever the score of a test 
sample falls below a, which is supposed to be the desired false alarm level. The 
resulting anomaly detector is shown to be asymptotically optimal in that it is uni- 
formly most powerful for the specified false alarm level, a, for the case when 
the anomaly density is a mixture of the nominal and a known density. Our al- 
gorithm is computationally efficient, being linear in dimension and quadratic in 
data size. It does not require choosing complicated tuning parameters or function 
approximation classes and it can adapt to local structure such as local change in 
dimensionality. We demonstrate the algorithm on both artificial and real data sets 
in high dimensional feature spaces. 

1 Introduction 

Anomaly detection involves detecting statistically significant deviations of test data from nominal 
distribution. In typical applications the nominal distribution is unknown and generally cannot be 
reliably estimated from nominal training data due to a combination of factors such as limited data 
size and high dimensionality. 

We propose an adaptive non-parametric method for anomaly detection based on score functions that 
maps data samples to the interval [0, 1]. Our score function is derived from a K-nearest neighbor 
graph (K-NNG) on n-point nominal data. Anomaly is declared whenever the score of a test sample 
falls below a (the desired false alarm error). The efficacy of our method rests upon its close connec- 
tion to multivariate p-values. In statistical hypothesis testing, p-value is any transformation of the 
feature space to the interval [0, 1] that induces a uniform distribution on the nominal data. When test 
samples with p-values smaller than a are declared as anomalies, false alarm error is less than a. 

We develop a novel notion of p-values based on measures of level sets of likelihood ratio functions. 
Our notion provides a characterization of the optimal anomaly detector, in that, it is uniformly most 
powerful for a specified false alarm level for the case when the anomaly density is a mixture of the 
nominal and a known density. We show that our score function is asymptotically consistent, namely, 
it converges to our multivariate p-value as data length approaches infinity. 

Anomaly detection has been extensively studied. It is also referred to as novelty detection 
outlier detection Q, one-class classification fl4j [5] and single-class classification |6| in the liter- 
ature. Approaches to anomaly detection can be grouped into several categories. In parametric 
approaches |7| the nominal densities are assumed to come from a parameterized family and gen- 
eralized likelihood ratio tests are used for detecting deviations from nominal. It is difficult to use 
parametric approaches when the distribution is unknown and data is limited. A K-nearest neighbor 
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(K-NN) anomaly detection approach is presented in (3] 0. There an anomaly is declared whenever 
the distance to the K-th nearest neighbor of the test sample falls outside a threshold. In comparison 
our anomaly detector utilizes the global information available from the entire K-NN graph to detect 
deviations from the nominal. In addition it has provable optimality properties. Learning theoretic 
approaches attempt to find decision regions, based on nominal data, that separate nominal instances 
from their outliers. These include one-class SVM of Scholkopf et. al. [5] where the basic idea 
is to map the training data into the kernel space and to separate them from the origin with maxi- 
mum margin. Other algorithms along this line of research include support vector data description 
|[T0l , linear programming approach JTJ, and single class minimax probability machine ifTTIl . While 
these approaches provide impressive computationally efficient solutions on real data, it is generally 
difficult to precisely relate tuning parameter choices to desired false alarm probability. 

Scott and Nowak lfl2l derive decision regions based on minimum volume (MV) sets, which does 
provide Type I and Type II error control. They approximate (in appropriate function classes) level 
sets of the unknown nominal multivariate density from training samples. Related work by Hero 
iTPJl based on geometric entropic minimization (GEM) detects outliers by comparing test samples 
to the most concentrated subset of points in the training sample. This most concentrated set is the 
A' -point minimum spanning tree(MST) for ?i-point nominal data and converges asymptotically to 
the minimum entropy set (which is also the MV set). Nevertheless, computing A'-MST for n-point 
data is generally intractable. To overcome these computational limitations |[T3l proposes heuristic 
greedy algorithms based on leave-one out K-NN graph, which while inspired by A'-MST algorithm 
is no longer provably optimal. Our approach is related to these latter techniques, namely, MV sets 
of 1 1 2 1 and GEM approach of [ 1 3 1 . We develop score functions on K-NNG which turn out to be the 
empirical estimates of the volume of the MV sets containing the test point. The volume, which is a 
real number, is a sufficient statistic for ensuring optimal guarantees. In this way we avoid explicit 
high-dimensional level set computation. Yet our algorithms lead to statistically optimal solutions 
with the ability to control false alarm and miss error probabilities. 

The main features of our anomaly detector are summarized. (1) Like flTI our algorithm scales 
linearly with dimension and quadratic with data size and can be applied to high dimensional feature 
spaces. (2) Like |[T2l our algorithm is provably optimal in that it is uniformly most powerful for 
the specified false alarm level, a, for the case that the anomaly density is a mixture of the nominal 
and any other density (not necessarily uniform). (3) We do not require assumptions of linearity, 
smoothness, continuity of the densities or the convexity of the level sets. Furthermore, our algorithm 
adapts to the inherent manifold structure or local dimensionality of the nominal density. (4) Like lf]~3l 
and unlike other learning theoretic approaches such as j9] [T2) we do not require choosing complex 
tuning parameters or function approximation classes. 

2 Anomaly Detection Algorithm: Score functions based on K-NNG 

In this section we present our basic algorithm devoid of any statistical context. Statistical analysis 
appears in Section [3] Let S = {x±,X2, ■ ■ ■ , x n } be the nominal training set of size n belonging to 
the unit cube [0, l] d . For notational convenience we use rj and x n +i interchangeably to denote a test 
point. Our task is to declare whether the test point is consistent with nominal data or deviates from 
the nominal data. If the test point is an anomaly it is assumed to come from a mixture of nominal 
distribution underlying the training data and another known density (see Section[3]i. 

Let d(x, y) be a distance function denoting the distance between any two points x, y € [0, l) d . For 
simplicity we denote the distances by = d(xi,Xj). In the simplest case we assume the distance 
function to be Euclidean. However, we also consider geodesic distances to exploit the underly- 
ing manifold structure. The geodesic distance is defined as the shortest distance on the manifold. 
The Geodesic Learning algorithm, a subroutine in Isomap lfl4l [TSl can be used to efficiently and 
consistently estimate the geodesic distances. In addition by means of selective weighting of differ- 
ent coordinates note that the distance function could also account for pronounced changes in local 
dimensionality. This can be accomplished for instance through Mahalanobis distances or as a by 
product of local linear embedding |[T6l . However, we skip these details here and assume that a 
suitable distance metric is chosen. 

Once a distance function is defined our next step is to form a K nearest neighbor graph (K-NNG) or 
alternatively an e neighbor graph (e-NG). K-NNG is formed by connecting each Xi to the K closest 
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points {xi ± , ■ • • , Xi K } in S — {xi}. We then sort the K nearest distances for each Xi in increasing 
order < •■ • < di t i K and denote Rs(xi) = di,i K , that is, the distance from Xi to its K-th 
nearest neighbor. We construct e-NG where x% and xj are connected if and only if d%j < e. In this 
case we define Ns(xi) as the degree of point Xi in the e-NG. 

For the simple case when the anomalous density is an arbitrary mixture of nominal and uniform 
densitjQ we consider the following two score functions associated with the two graphs K-NNG and 
e-NNG respectively. The score functions map the test data 77 to the interval [0, 1]. 

1 - 

K-LPE: Pk(v) = -^2hRs(v)<Rs(x z )} (1) 

1 - 

e-LPE: p e (rj) = - 22l{N s (r,)>N s (xi)} ( 2 ) 
i=l 

where Is.\ is the indicator function. 

Finally, given a pre-defined significance level a (e.g., 0.05), we declare 77 to be anomalous if 
pk(v): Pe(v) — a - We ca ^ tn i s algorithm Localized p-value Estimation (LPE) algorithm. This 
choice is motivated by its close connection to multivariate p-values(see Section^. 

The score function K-LPE (or e-LPE) measures the relative concentration of point 77 compared to 
the training set. Section[3]establishes that the scores for nominally generated data is asymptotically 
uniformly distributed in [0, 1] . Scores for anomalous data are clustered around 0. Hence when scores 
below level a are declared as anomalous the false alarm error is smaller than a asymptotically (since 
the integral of a uniform distribution from to a is a). 




Figure 1: Left: Level sets of the nominal bivariate Gaussian mixture distribution used to illustrate the K- 
LPE algorithm. Middle: Results of K-LPE with K = 6 and Euclidean distance metric for m = 150 test 
points drawn from a equal mixture of 2D uniform and the (nominal) bivariate distributions. Scores for the test 
points are based on 200 nominal training samples. Scores falling below a threshold level 0.05 are declared as 
anomalies. The dotted contour corresponds to the exact bivariate Gaussian density level set at level a = 0.05. 
Right: The empirical distribution of the test point scores associated with the bivariate Gaussian appear to be 
uniform while scores for the test points drawn from 2D uniform distribution cluster around zero. 

Figure Q] illustrates the use of K-LPE algorithm for anomaly detection when the nominal data is a 
2D Gaussian mixture. The middle panel of figure Q] shows the detection results based on K-LPE are 
consistent with the theoretical contour for significance level a = 0.05. The right panel of figureQ] 
shows the empirical distribution (derived from the kernel density estimation) of the score function 
K-LPE for the nominal (solid blue) and the anomaly (dashed red) data. We can see that the curve for 
the nominal data is approximately uniform in the interval [0, 1] and the curve for the anomaly data 
has a peak at 0. Therefore choosing the threshold a = 0.05 will approximately control the Type I 
error within 0.05 and minimize the Type II error. We also take note of the inherent robustness of our 
algorithm. As seen from the figure (right) small changes in a lead to small changes in actual false 
alarm and miss levels. 



When the mixing density is not uniform but, say fi , the score functions must be modified to fix (v) 
and p e (77) = i^ =1 lf N f $ ^ V ) S J f 5 ( ^.') ) ) for lhc two S m P hs K-NNG and e-NNG respectively. 
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To summarize the above discussion, our LPE algorithm has three steps: 



(1) Inputs: Significance level a, distance metric (Euclidean, geodesic, weighted etc.). 

(2) Score computation: Construct K-NNG (or e-NG) based on dij and compute the score function 
K-LPE from EquationQ](or e-LPE from Equation|2]). 

(3) Make Decision: Declare rj to be anomalous if and only if pk (rj) < a (or p e (rf) < a). 

Computational Complexity: To compute each pairwise distance requires O(d) operations; and 
0(n 2 d) operations for all the nodes in the training set. In the worst-case computing the K-NN graph 
(for small K) and the functions Rs{-)> Ns(-) requires 0(n 2 ) operations over all the nodes in the 
training data. Finally, computing the score for each test data requires 0(nd+n) operations(given that 
Rs('), Ns(-) have already been computed). 

Remark: LPE is fundamentally different from non-parametric density estimation or level set esti- 
mation schemes (e.g., MV-set). These approaches involve explicit estimation of high dimensional 
quantities and thus hard to apply in high dimensional problems. By computing scores for each test 
sample we avoid high-dimensional computation. Furthermore, as we will see in the following sec- 
tion the scores are estimates of multivariate p-values. These turn out to be sufficient statistics for 
optimal anomaly detection. 

3 Theory: Consistency of LPE 

A statistical framework for the anomaly detection problem is presented in this section. We establish 
that anomaly detection is equivalent to thresholding p-values for multivariate data. We will then 
show that the score functions developed in the previous section is an asymptotically consistent esti- 
mator of the p-values. Consequently, it will follow that the strategy of declaring an anomaly when a 
test sample has a low score is asymptotically optimal. 

Assume that the data belongs to the d-dimensional unit cube [0, l] d and the nominal data is sam- 
pled from a multivariate density fo(x) supported on the d-dimensional unit cube [0, l] d . Anomaly 
detection can be formulated as a composite hypothesis testing problem. Suppose test data, rj comes 
from a mixture distribution, namely, f(rf) = (1 — "x)fo(r]) + irfi (rj) where /i(rj) is a mixing density 
supported on [0, l] d . Anomaly detection involves testing the nominal hypotheses Hq : tt = versus 
the alternative (anomaly) Hi : tt > 0. The goal is to maximize the detection power subject to false 
alarm level a, namely, V (declare Hi | Ho) < a. 

Definition 1. Let Vq be the nominal probability measure and /i(-) be Vq measurable. Suppose the 
likelihood ratio fi(x)/fo(x) does not have non-zero flat spots on any open ball in [0, l] d . Define 
the p-value of a data point 77 as 



Note that the definition naturally accounts for singularities which may arise if the support of /o(-) 
is a lower dimensional manifold. In this case we encounter fi(rf) > 0, fo(r)) = and the p-value 
p(rf) = 0. Here anomaly is always declared(low score). 

The above formula can be thought of as a mapping of 77 — > [0, 1]. Furthermore, the distribution of 
p{rj) under Hq is uniform on [0, 1]. However, as noted in the introduction there are other such trans- 
formations. To build intuition about the above transformation and its utility consider the following 
example. When the mixing density is uniform, namely, fi (77) = U (77) where U (rf) is uniform over 
[0, note that Vl a = {77 | p(rf) > a} is a density level set at level a. It is well known (see lfT2l ) 
that such a density level set is equivalent to a minimum volume set of level a. The minimum volume 
set at level a is known to be the uniformly most powerful decision region for testing Hq : tt = 
versus the alternative H 1 : n > (see |fT3l[r21 ). The generalization to arbitrary fi is described next. 

Theorem 1. The uniformly most powerful test for testing Hq : 7r = versus the alternative 
(anomaly) Hi : tt > at a prescribed level a of significance Videclare Hi \ Hq) < a is: 
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Proof. We provide the main idea for the proof. First, measure theoretic arguments are used to 
establish p(X) as a random variable over [0, 1] under both nominal and anomalous distributions. 

Next when X ~ /o, i.e., distributed with nominal density it follows that the random variable p(X) ~ 

U[0, 1]. When X ~ f = (1 — 7r)/o + nf l with 7r > the random variable, p(X) ~ 5 where g(-) 
is a monotonically decreasing PDF supported on [0, 1]. Consequently, the uniformly most powerful 
test for a significance level a is to declare p-values smaller than a as anomalies. □ 

Next we derive the relationship between the p-values and our score function. By definition, Rs(v) 
and Rs(xi) are correlated because the neighborhood of r\ and Xj might overlap. We modify our 
algorithm to simplify our analysis. We assume n is odd (say) and can be written as n = 2m + 1. 
We divide training set S into two parts: 

s = Si n s 2 = {x ,xi, ■ ■ ■ ,x m } n {.x rn+ i, • • • ,x 2m } 

We modify e-LPE to p e (r?) = ^- l J2x z es 1 I {Ns 2 (ri)>N Sl (x t )} (or A'-LPE to p K (r)) = 
^ Ex.eSi hRs 2 (v)<R Sl Now Rs 2 (v) and R Sl (x,) are independent. 

Furthermore, we assume /o(-) satisfies the following two smoothness conditions: 

1. the Hessian matrix H{x) of fo(x) is always dominated by a matrix with largest eigenvalue 
Am, i-e., 3M s.t. H(x) ■< M Vx and A max (M) < A M 

2. In the support of /o(-)> its value is always lower bounded by some (3 > 0. 
We have the following theorem. 

Theorem 2. Consider the setup above with the training data {xi}" =1 generated i.i.d. from fo(x). 
Let i] 6 [0, l] d be an arbitrary test sample. It follows that for a suitable choice K and under the 
above smoothness conditions, 

\pk(v) ~ p(v) \ almost surely, Vr? G [0, l] d 

For simplicity, we limit ourselves to the case when fx is uniform. The proof of Theorem[2]consists 
of two steps: 

• We show that the expectation Es x [p t (ji)\ ^— p(yj) (Lemma |3). This result is then ex- 
tended to K-LPE (i.e. Es^ [pk(v)} ~* p(v)) m Lemma[4] 

• Next we show that pk(v) ~ ~* ^Si [pk(v)] y i a concentration inequality (Lemma[5]). 
Lemma 3 (e-LPE). By picking e = m~~k \[^e' w ^ probability at least 1 — e -/3m 1/15 /2 ; 

l m {v) < Esx \pe(v)) < (3) 

where 

Uv) = V {x : (f (r,) - A,) (1 - A 2 ) > (f (x) + A x ) (1 + A 2 )} - e ^ ml/1 ^ 2 
u m ( V ) = V {x : (Mn) + AO (1 + A 2 ) > (f (x) - A x ) (1 - A 2 )} + e"^ 1 ' 3/2 
Ai = \ M m-^ 5d /(2ne(d + 2)) and A 2 = 2m~ 1 ^ 6 . 

Proof. We only prove the lower bound since the upper bound follows along similar lines. By inter- 
changing the expectation with the summation, 



Es, [f.M] = E.s, 



m — * 



, {Ns 2 (v)>N Sl (xi)} 



m ^ — ' * 



m 
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where the last inequality follows from the symmetric structure of {xo, xi, ■ ■ ■ , x m }. 

Clearly the objective of the proof is to show Vs 1 \x 1 {Ns 2 {v) > Ns^xi)) n -^+ I{/ (>))>/o(zi)}- 
Skipping technical details, this can be accomplished in two steps. (1) Note that Ns(xi) is a binomial 
random variable with success probability q(x 1 ) := J B fo(x\ +t)dt. This relates Vs 1 \ Xl {Ng 2 (77) > 

Ns^xi)) to I{ g (, ) )> g (a :i )}- (2) We relate I{g(^)>q(xi)} to lyo^/o^i)} based on the function 
smoothness condition. The details of these two steps are shown in the below. 

Note that Ns 1 (xi) ~ Binom(m, q(x\)). By Chernoff bound of binomial distribution, we have 

V Sl \ Xl (N Sl (xi) - mq{ Xl ) >6)< e -«I7 
that is, N$ 1 (xi) is concentrated around mq(xi). This implies, 

V Sl \ Xl (NsM > N Sl (xi)) > I {iV S2W > ro3 (x 1 )+^ 1 }- e " 3 ^ W 
We choose 6 Xl = q(xi)m 1 (^ will be specified later) and reformulate equation (01 as 

1 mVol(B E ) — Vol(B e ) m l — i ) ( 

Next, we relate q(xi)(or J B fo(x\ +t)dt) to fo(xi) via the Taylor's expansion and the smoothness 
condition of / , 

Is /o(a?i+*)ctt 



Vol(B 6 ) 
and then equation (0 becomes 



/ofai) 



^■*ik)A. l|i||2dt = fe 



p »A.,(^.Wa^,(xO)>i { ^, (Mii)+ ^ )(1+ _ jtT) j 



- e : 



By applying the same steps to Ng 2 (ri) as equation [4] (Chernoff bound) and equation [6] (Taylor's 
explansion), we have with probability at least 1 — e 2 ; 



-e 



Finally, by choosing e 2 = m sd • and 7 = 5/6, we prove the lemma. □ 
Lemma 4 (A'-LPE). By picking K = (l - 2?n~ 1 / 6 ) m 2 / 5 (/o(f?) - Ai), w/f/1 probability at least 

1 - e -3™ 1/15 /2 ; 

*m(»7) < E Sl < u m {ri) (7) 

Proof. The proof is very similar to the proof to Lemma|3]and we only give a brief outline here. Now 
the objective is to show (Rs 2 (v) < Rs ± (^1)) ~ " ft{/o(»;)>/o(a:i)}-Th e basic idea is to use 

the result of Lemma[3] To accomplish this, we note that {Rs 2 {if) < Rs 1 (xi)} contains the events 
{Ns 2 (v) > K ) n {N Sl {x\) < K}, or equivalently 

{Ns 2 (v) ~ >K- q(r))m} n {N Sl {xi) - q{xi)m < K — q{x\)m} (8) 

By the tail probability of Binomial distribution, the probability of the above two events converges to 
1 exponentially fast if K — q{rj)m < and K — q(xx)m > 0. By using the same two-step bounding 
techniques developed in the proof to Lemma|3] these two inequalities are implied by 

K - m 2/5 (f (v) - AO < and K - m 2 / 5 (f ( Xl ) + Ai) >0 

Therefore if we choose K = (l — 2m~ 1 / 6 ) m 2 / 5 (foiv) ~ Ai), we have with probability at least 

I _ p -/3m- 1/15 /2 



r PSi\x 1 (Rs 2 ('n) < RsAxi)) > I{(/ (^)-Ai)(l-A 2 )>(/o(xi)+Ai)(l+A 2 )} 



e 



□ 
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Remark: Lemma[3]and Lemma[4]were proved with specific choices for e and K. However, e and K 
can be chosen in a range of values, but will lead to different lower and upper bounds. We will show 
in Section H] through simulations that our LPE algorithm is generally robust to choice of parameter 
K. 

Lemma 5. Suppose K = cm 2 / 5 and denote pk(v) = ^Ei eSi ^{Rs 2 (v)<Rs 1 We have 

V (|E Sl \Pk(v)]-Pk(v)\ >5)<2e ^ 

where 7^ is a constant and is defined as the minimal number of cones centered at the origin of angle 
7r/6 that cover M. d . 

Proof. We can not apply Law of Large Number in this case because ^-{r s (■ n )<Rs 1 (xi)} are cor- 
related. Instead, we need to use the more generalized concentration-of-measure inequality such 
as MacDiarmid's inequality (171 • Denote F(x , x m ) = ^Y, x ^s 1 l {Rs 2 (n)<Rs 1 {x t )}- From 
Corollary 11.1 in fljD, 

sup \F(x , ■ ■ ■ ,Xi,- ■ ■ ,x m ) - F(x , ■ ■ ■ ,Xi,- ■ ■ ,x n )\ < Kjd/m (9) 

Xq,-- - ,X m ,x' 

Then the lemma directly follows from applying McDiarmid's inequality. □ 

Theorem |2] directly follows from the combination of Lemma|4]and Lemma[5]and a standard appli- 
cation of the first Borel-Cantelli lemma. We have used Euclidean distance in Theorem|2] When the 
support of /o lies on a lower dimensional manifold (say d! < d) adopting the geodesic metric leads 
to faster convergence. It turns out that d' replaces d in the expression for Ai in Lemma 3. 

4 Experiments 

We apply our method on both artificial and real-world data. Our method enables plotting the entire 
ROC curve by varying the thresholds on our scores. 

To test the sensitivity of K-LPE to parameter changes, we first run K-LPE on the benchmark ar- 
tificial data-set Banana |fl9l with K varying from 2 to 12. Banana dataset contains points with 
their labels(+l or —1). We randomly pick 109 points with +1 label and regard them as the nominal 
training data. The test data comprises of 108 +1 data and 183 —1 data (ground truth) and the algo- 
rithm is supposed to predict +1 data as "nominal" and —1 data as "anomaly". See Figure|2ta) for 
the configuration of the training points and test points. Scores computed for test set using Equation 
[T]is oblivious to true /1 density (—1 labels). Euclidean distance metric is adopted for this example. 

False alarm (also called false positive) is defined as the percentage of nominal points that are pre- 
dicted as anomaly by the algorithm. To control false alarm at level a, point with score < a is 
predicted as anomaly. Empirical false alarm and true positives (percentage of anomalies declared as 
anomaly) can be computed from ground truth. We vary a to obtain the empirical ROC curve. We 
follow this procedure for all the other experiments in this section. We are relatively insensitive to K 
as shown in Figure|2|b). 

For comparison we plot the empirical ROC curve of the one-class SVM of |9!]. There are two tuning 
parameters in OC-SVM — bandwidth c (we use RBF kernel) and v 6 (0, 1) (which is supposed 
to control FA). Note that training data does not contain —1 labels and this implies we can never 
make use of —1 labels to cross-validate, or, to optimize over the choice of pair (c, v). In our OC- 
SVM implementation, by following the same procedure, we can obtain the empirical ROC curve by 
varying v but fixing a certain bandwidth c. Finally we iterated over different c to obtain the best (in 
terms of AUC) ROC curve and it turns out to be c = 1.5. Fixing c for entire ROC is equivalent to 
fixing K in our score function. Note that in real practice what can be done is even worse than this 
implementation because there is also no natural way to optimize over c without being revealed the 
— 1 labels. 

In Figure |2|b), we can see that our algorithm is consistently better than one-class SVM on the 
Banana dataset. Furthermore, we found that choosing suitable tuning parameters to control false 
alarms is generally difficult in the one-class SVM approach. In our approach if we set a = 0.05 we 
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banana daia set 




(a) Configuration of banana data (b) SVM vs. K-LPE for Banana Data 

Figure 2: Performance Robustness ofLPE;(a) The configuration of the nominal training points (red '+') and 
unlabeled test points (black ' • ') for the banana dataset H9V ; (b) Empirical ROC curve of K-LPE on the 
banana dataset with K — 2, 4, 6, 8, 10, 12 (with n — 400) vs the empirical ROC curve of one class SVM 
developed in 




Figure 3: Clairvoyant ROC curve vs. K-LPE; (a) Configuration of the nominal training points and unlabeled 
test points for the data given by Equation 1 1 01 (b) Averaged (over 15 trials) empirical ROC curves of K-LPE 
algorithm vs clairvoyant ROC curve (when fo is given by Equation\l(M for K = 6 and for different values of 
n(n = 40, 160). 



get empirical FA = 0.06 and for a = 0.08, empirical FA = 0.09. For OC-SVM we can not see 
any natural way of picking c and v to control FA rate based only on training data. 

In Figure[3] we apply our X-LPE to another 2D artificial example where the nominal distribution fo 
is a mixture Gaussian and the anomalous distribution is very close to uniform (see Figure [3j a) for 
their configuration): 



1 
9 



-TV 







1 
9 



/i ~ AH 0, 



49 
49 



(10) 



In this example, we can exactly compute the optimal ROC curve. We call this curve the Clairvoyant 
ROC (the red dashed curve in Figure Ob)). The other two curves are averaged (over 15 trials) 
empirical ROC curves with respect to different sizes of training sample (n = 40, 160) for K = 6. 
Larger n results in better ROC curve. We see that for a relatively small training set of size 160 the 
average empirical ROC curve is very close to the clairvoyant ROC curve. 

Next, we ran LPE on three real-world datasets: Wine, Ionosphere l|20l and MNIST US Postal 
Service (USPS) database of handwritten digits. The procedure and setup of the experiments is 
almost the same as the that of the Banana data set. However, there are two differences. (1) If the 
number of different labels is greater than two, we always treat points with one particular label as 
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nominal(+l) and regard the points with other labels as anomalous(— 1). For example, for the USPS 
dataset, we regard instances of digit as nominal training samples and instances of digits 1, • • • ,9 
as anomaly. (2) For high dimensional data set, the data points are normalized to be within [0, l] d 
and we use geodesic distance [ 14|(instead of Euclidean distance) as the input to LPE. 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(a) Wine (b) Ionosphere (c)USPS 

Figure 4: ROC curves on real datasets via LPE; (a) Wine dataset with D = 13, n = 39, e = 0.9; (b) 
Ionosphere dataset with D = 34, n = 175, K = 9; (c) USPS dataset with D = 256, n = 400, K = 9. 

The ROC curves of these three datasets are shown in Figure|4] In Wine dataset, the dimension of the 
feature space is 13. The training set is composed of 39 data points and we apply the e-LPE algorithm 
with e = 0.9. The test set is a mixture of 20 nominal points and 158 anomaly points (ground truth). 
In Ionosphere dataset, the dimension of the feature space is 34. The training set is composed 
of 175 data points and we apply the K -LPE algorithm with K = 9. The test set is a mixture of 
50 nominal points and 126 anomaly points (ground truth). In USPS dataset, the dimension of the 
feature space is 16 x 16 = 256. The training set is composed of 400 data points and we apply the 
A'-LPE algorithm with K = 9. The test set is a mixture of 367 nominal points and 33 anomaly 
points (ground truth). 

For comparison purposes we note that for the USPS data set by setting a = 0.5 we get empirical 
false-positive 6.1% and empirical false alarm rate 5.7% (In contrast FP = 7% and FA = 9% 
with v = 5% for OC-SVM as reported in J9])- Practically we find that K -LPE is more preferable 
to e-LPE due to easiness of choosing the parameter K. We find that the value of K is relatively 
independent of dimension d. As a rule of thumb we found that setting K around n 2 / 5 was generally 
effective. 



5 Conclusion 



In this paper, we proposed a novel non-parametric adaptive anomaly detection algorithm which leads 
to a computationally efficient solution with provable optimality guarantees. Our algorithm takes a 
K-nearest neighbor graph as an input and produces a score for each test point. Scores turn out to be 
empirical estimates of the volume of minimum volume level sets containing the test point. While 
minimum volume level sets provide an optimal characterization for anomaly detection, they are 
high dimensional quantities and generally difficult to reliably compute in high dimensional feature 
spaces. Nevertheless, a sufficient statistic for optimal tradeoff between false alarms and misses is 
the volume of the MV set itself, which is a real number. By computing score functions we avoid 
computing high dimensional quantities and still ensure optimal control of false alarms and misses. 
The computational cost of our algorithm scales linearly in dimension and quadratically in data size. 
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